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Abstract 

Modern technologies are producing a wealth of data with complex structures. For 
instance, in two-dimensional digital imaging, flow cytometry, and electroencephalogra- 
phy, matrix type covariates frequently arise when measurements are obtained for each 
combination of two underlying variables. To address scientific questions arising from 
those data, new regression methods that take matrices as covariates are needed, and 
sparsity or other forms of regularization are crucial due to the ultrahigh dimensionality 
and complex structure of the matrix data. The popular lasso and related regulariza- 
tion methods hinge upon the sparsity of the true signal in terms of the number of its 
nonzero coefficients. However, for the matrix data, the true signal is often of, or can be 
well approximated by, a low rank structure. As such, the sparsity is frequently in the 
form of low rank of the matrix parameters, which may seriously violate the assumption 
of the classical lasso. In this article, we propose a class of regularized matrix regres- 
sion methods based on spectral regularization. Highly efficient and scalable estimation 
algorithm is developed, and a degrees of freedom formula is derived to facilitate model 
selection along the regularization path. Superior performance of the proposed method 
is demonstrated on both synthetic and real examples. 

Key Words: Electroencephalography; multidimensional array; Nesterov method; nuclear 
norm; spectral regularization; tensor regression. 



1 Introduction 

Modern scientific applications are frequently producing data sets where the sampling unit is 
not in the form of a vector but instead a matrix. Examples include two-dimensional digital 
imaging data, which contain the quantized brightness value of a given color at a number of 
rows and columns of pixels; and flow cytometric data, which consist of the fluorescence inten- 
sity of multiple cells at multiple channels. Our motivating example is a study of an electroen- 
cephalography (EEG) data of alcoholism (http://kdd.ics.uci.edu/datasets/eeg/eeg.data.html). 
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The study consists of 122 subjects with two groups: an alcoholic group and a normal control 
group, and each subject was exposed to a stimulus. Voltage values were measured from 64 
channels of electrodes placed on the subject's scalp for 256 time points, so each sampling unit 
is a 256 x 64 matrix. It is of scientific interest to st udy the associa tion between alcoholism 
and the pattern of voltage over times and cha nnels (]Li et all 120101 ) . The generalized linear 
model (GLM) fjMcCullagh and Nelderl . Il983l ) offers a useful tool for that purpose, where 
the response Y is the binary indicator of alcoholic or control, and the predictors include 
the matrix-valued EEG data X and possible covariate vector Z such as age and gender. 
However, the classical GLM deals with a vector of covariates, and the presence of matrix 
type covariates poses fresh challenges to statistical analysis. First, naively turning a matrix 
into a vector results in an exceedingly large dimensionality; for instance, for the EEG data, 
the dimension is p = 256 x 64 = 16, 384, whereas the sample size is only n = 122. Second, 
vectorization destroys the wealth of structural information inherently possessed in the ma- 
trix data; e.g., it is commonly expected that the voltage values of the adjacent time points 
and channels are highly correlated. Given the ultrahigh dimensionality and the complex 
structure, regularization becomes crucial for the analysis of such data. In this article, we 
propose a novel regularization solution for regression with matrix covariates that efficiently 
tackles the ultrahigh dimensionality while preserving the matrix structure. 

A large variety of regularization methods have been developed in recent years. Among 
them, penalization has been playing a powerful role in stabilizing the estimates, improving 
the risk property, and increasing the gen eralization power in classical regressions. Popu - 
lar pen alization techniqu es include lasso (Tibshirani, 



1996; 



SCA D (IFan and Lil . l200ll ) , fused lasso ( ITibshirani et al.l . 120051 ) , elastic net (IZou and Hastie 



Donoho and Johnstond. Il9 94). 



20051 ). and many others. For regressions with matrix covariates, a direct approach is to 



first vectorize the covariates then apply the classical penalization techniques. Regularization 
helps alleviate the problem that the dimensionality far exceeds the sample size. However, 
this is unsatisfactory, since it does not incorporate the matrix structural information. More 
importantly, the solution is based upon a fundamental assumption that the true underlying 
signal is sparse in terms of the norm of the regression parameters. In matrix regressions, 
however, it is often the case that the true signal is of a low rank structure, or can be well 
approximated by a low rank structure. As such, sparsity is in terms of the rank of the matrix 
parameters, which is intrinsically different from sparsity in the number of nonzero entries. 
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Figure 1: Comparison of the nuclear norm regularized estimation to the classical lasso. 
The matrix covariate is 64 x 64, and the sample size is 500. Top panel: True signal (left), 
nuclear norm regularized estimate with minimal BIC (center), and the classical lasso estimate 
with minimal BIC (right). Bottom panel: BIC along solution paths for the nuclear norm 
regularization (left) and the classical lasso regularization (right). 



To see how such a difference affects signal estimation in matrix regressions, we consider 
the following illustrative example. We generated a normal response Y with mean, fi = ~y t Z+ 
(B, X), and variance one. Z e IR 5 denotes a usual vector of covariates with standard normal 
entries, and 7 = (1, . . . , 1) T . X E ~$i 6 4x64 denotes the matrix covariates, of which all entries 
are standard normal, and B is the coefficient matrix of the same size. B is binary, with the 
true signal region, which is a cross shape in our example, equal to one and the rest zero. The 
inner product between two matrices is defined as (B, X) = (vecB, vecX) = £\ ^ Pi^x^, 
where vec(-) is the vectorization operator that stacks the columns of a matrix into a vector. 
We sampled n = 500 instances {(yi, x i} Zi), % = 1, ...,500}, and our goal is to identify B 



pro 



through a regression of yi on (ccj, Zj). We note that our 
detection or object recognition in imaging processing (IQiu 



em differs 



2005 



rom the usual edge 
20071 ). In our setup, all 



elements of the image X follow the same distribution. The signal region is defined through 
the coefficient image B and needs to be inferred from the association between Y and X 
after adjusting for Z. We also note that the total number of entries in B is 4096 = 64 2 , and 
the number of nonzero ones is 240 (about 5.8%). We applied two approaches. The first is 
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lasso to the vectorized X. That is, we solve the optimization problem 



1 n 

in ~ ^2(Vi ~ 7 T ^ - (B, x l )) 2 + A||vecB||i, 



mm 
b 2 

i=i 

where ||veoB||i is the l\ norm of the vectorized B, and A is the regularization parameter. 
The lower right panel of Figure [1] displays the Bayesian information criterion (BIC) along 
the lasso solution path, which suggests a model with the maximum number of predictors 
(500) that is allowed given the sample size. The parameter estimate under this model is 
shown in the upper right panel, which appears far away from the truth. The second solution 
we consider is penalizing the nuclear norm of B, i.e., we solve 

1 n 

j=i 

where the nuclear norm = <J j(B), and <x,-(-B)'s are the singular values of the matrix 

B. The nuclear norm ||-B||* is a suitable measure of the "size" of a matrix parameter, and 
axation of rank(S) = ||cr(S)|| . This is analogous to the i\ norm for a vector 



is a convex re 



( IRecht et al. 



20101 ) . The lower left graph of Figure [T] displays the BIC along the solution 



path of the nuclear norm penalized matrix regression, and the upper middle panel shows the 
corresponding estimate with minimal BIC. It is clearly seen that the nuclear norm estimate 
achieves a substantially better rec overy than the lasso estimate. One might argue that 
fused lasso ( ITibshirani et al.l . 120051 ) might give a better recovery of such piecewise constant 
signals. However, there are numerous low rank signals, e.g., (01 . . . 01) T (10 . . . 10), which are 
extremely non-smooth and would fail fused lasso. 

More generally, in this article, we propose a family regularized regression models with 
matrix covariates based on spectral regularization. Our contributions are multifold. First, 
we employ a spectral regularization formulation and integrate within a generalized linear 
model (GLM) framework. The resulting model works for a variety of penalization functions, 
including lasso, elastic net, SCAD, and many others, as well as different types of response 
variables, including normal, binary and count outcomes. Second, we develop a highly ef- 
ficient and scalable Nesterov algorithm for model estimation with explicit, non- asymptotic 
convergence rate. We emphasize that such a highly scalable algorithm is critical for analyz- 
ing large-scale and ultrahigh-dimensional matrix data. Third, we derive the effective degrees 
of freedom of selected models, which is crucial for tuning of the regularization parameter. 
The result can be viewed as an extension of the degrees of freedom development from the 
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classical lasso model ( Zou et al. 



20071 ) and the group lasso model (I Yuan and Linl . 120061 ) to 



the generalized linear matrix model. On the other hand, our proposal is not simply another 
variant of lasso and alike. We aim at matrix regression problems, which are important in 
imaging and other scientific applications but have received relatively little attention. 

Our proposal is related to but also distinct from two recent developments involving matrix 
data. The first is a recent proposal of a fa mily of generalized linear models with matrix or 
tensor (multidimensional arrays) covariates ( iZhou et al.l . 120121 ) . The basic idea is to impose a 
particular low rank structure (CP decomposition) on B and then introduce a sparse penalty 
on the coefficients of B. That solution fits the model at a fixed rank of the matrix/tensor 
regression parameters, and thus corresponds to the hard thresholding in the classical vector 
covariate case. In contrast, our solution does not fix the rank of the parameters and is a 
soft thresholding procedure. Moreov er, even wh e n usi ng a convex penalty function such 
as the lasso penalty, the approach of IZhou et al.l ( 120121 ) involves a challenging non-convex 



optimizati on task, whereas the so lution in this paper remains a convex problem. We also 
note that 



case of 



Hun g and Wand ( 120111 ) considered matrix logistic regression, which is a special 



Zhou et al. 



(120 121 ). and they did not investigate any sparsity regularization. The 



second related work is the line of researc h in matrix completion, w 



regularization h as been widely employed (ICandes and Recht 



2009 



lere nuclear norm type 



Mazumder et al 



2010 



Cai et al. 



2010l ). However, the two approaches are different in that, the matrix completion 



problem aims to recover a low rank matrix when only a small portion of its entries are 
observed., whereas our approach concerns about regressions with matrix covariates. 

The rest of the article is organized as follows. We formulate the spectral regularization 
for matrix regression in Section |2j and develop a highly scalable algorithm for the associated 
optimization in Section [3j We derive the degrees of freedom formula in Section HJ and 
investigate the numerical performance of the proposed method in Section [5j We conclude 
the paper with a discussion of potential future research in Section |6j We delegate all the 
technical proofs to the Appendix. 



2 Spectral Regularization 

We first fix the notations. For any matrix B € JR PlXp2 , a(B) = (a 1 (B), . . . ,a q (B)), q = 
min{pi,p 2 }> denotes the vector of decreasingly ordered singular value mapping of B. That 
is ax{B) > a 2 (B) > ...> a r (B) > a r+1 (B) = ... = a q (B) = 0, where r = rank(S). Let 
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Y denote the response variable, Z G IR P0 the vector covariate, X G TR P1 xp2 the 2D matrix 
covariate, and (y, x, z) their sample instances. 

We consider the generalized linear model setup, where Y belongs to an exponential family 
with probability mass function or density 

f 1/0 -6(0) / ,xl 
P{y\x, z) = exp <^ — — + c(y, 0) } , 



(1) 



a{4>) 

and the first conditional moment is E(Y \X, Z) — fi — b'(6), and /i is of the form 

g( f x) = r ] = 1 T Z+(B,X), 



where g is a known link function, 7 G IR po , and .B G ]R ? ' jX? ' 2 . For simplicity, in our subse- 
quent development of the regularized matrix model estimation, we drop the vector covariate 
Z and its associated parameter 7. However, the results can be extended straightforwardly to 
incorporate Z and 7. Also, we only consider GLM with a univariate response, whereas ex- 
tensions to more complex models such as quasi-likelihood models and multivariate responses 
are straightforward. 

For generality in terms of penalty, we consider the spectral regularization problem 



min h(B) = £(B) + J(B), 



(2) 



where £(B) is a loss function; for the GLM, we use the negative log-likelihood as the loss. 
J(B) = f o cr(-B), where / : H 9 — > 1R is a function of the singular values of B. The choice 
f(w) = X Y^j=i \ w j\ an d the least squares loss corresponds to the special case of the nuclear 
norm regularization problem we considered in the Introduction. In general, for sparsity of 
the spectrum, / takes the general form 

<? 

/(to) = ^P,(H,A), 

where P is a scalar penalty function, 77 is the parameter indexing the penalty family, and A 
is the tuning constant. We list some commonly used penalty functions below. 



Power family (IFrank and Friedman! . 119931 ) 



P 11 {\w\,X) = \\w\^ v e (0,2]. 



Two i mportant special c ases of this family are the lasso pen alty when 77 = 1 (jT ibs 



1996 



Chen et all 120011 ) and the ridge penalty when 77 = 2 (IHoerl and Kennardl . 



mram 



1970). 
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Elastic net ( jZou and Hastid . 120051 ) 

P v (\w\,X) = X {(ti - l)w 2 /2 + (2- V )\w\}, T} E [1, 2]. 
Varying rj from 1 to 2 bridges the lasso to the ridge penalty. 



Log penalty (jCandes et al. 



2008 



Armagan et al.l . 120111 ) 



P n (\w\,\) = \\n(r)+ \w\), r)>0. 



SCAD fjFan and Lil . l200ll ). in which the penalty is defined via its partial derivative 

jLp„(M, A) = A jl {H < A} + ( ^~ l 1 ^ )+ l{k[>A}} , V > 2- 

Integration shows SCAD as a natural quadratic spline with knots at A and r)X 

{X\w\ \w\ < X 

X 2 + vMM^l_^ \w\e[X,ij\). 
A 2 (r?+l)/2 \w\>r)X 
For small signals \w\ < A, it acts as lasso; for large signals \w\ > r]X, the penalty 
flattens and leads to the unbiasedness of the regularized estimate. 



• MC+ penalty ((Zhangj, |2010| ). which is similar to SCAD and is defined by the partial 
derivative 

5R^W,a) = a(i-M 

Integration shows that the penalty function 

(w 2 \ X 2 r] 
M w \ - 2~ ) 1 {l«l<Ai7} + — MM^Xt,}, V > 0, 

is quadratic on [0, Xrj] and flattens beyond Xi]. Varying rj from to oo bridges hard 
thresholding (£ regression) to lasso (£i) shrinkage. 

We also comment that, besides the above sparsity penalties, other forms of regularization 
can be useful, depending on the scientific question of interest. For instance, the choice 
f(w) = XY^ 9 jZ\ \ w j ~ w j+i\ = ^{ w i ~ w r) produces the regula rization for the "spiked" 
matrix model, i.e., matrices with clustered eigen-/singular values (jJohnstond . 1200 ll ). 

Convexity is essential for studying convergence properties of optimization problems. We 
first state the necessary and sufficient condition f or the convexity of the re gularizer J. Its 
proof follows from the theory of spectral function (IBorwein and Lewis! . 120061 ) and is given in 
the Appendix. 
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Lemma 1. The functional J{B) = foa(B) is convex and lower semicontinuous if and only 
if f is convex and lower semicontinuous. Furthermore, for a convex f , the subdifferential of 
J at B, which admits singular value decomposition Udia.g{b)V T , is 

dJ{B) = d{f o a){B) = Udi&g[df{b)]V T . 

Lemma [I] immediately leads to the optimality condition when both loss and regularizer are 
convex. 

Theorem 1. When both the loss I and f are convex, all local minima of the regularized pro- 
gram §2§ are global minimum and is unique if I is strictly convex. A matrix B = C/diag(6) V T 
is a global minimum if and only if 

P1XP2 G Vl(B) + Udmg[df(b)]V T . 

When either the loss I or / is non-convex, the regularized objective function ([2]) may be 
non-convex and there lacks an easy-to-check optimality condition. 



3 Estimation Algorithm 



1983 . 



2004 ) for min- 



We utilize the powerful Nesterov optimal gradient method (jNesterovl . 
imizing the non-smooth and possibly non-convex objective function ([2]). We first state a 
matrix thresholding formula for spectral regularization, which forms the building blocks of 
the Nesterov algorithm. 



Proposition 1. For a given matrix A with singular value decomposition A = C/diag(a) V T , 

the optimal solution to 

mm^\\B - A\\ 2 F + f o a(B) 

shares the same singular vectors as A and its ordered singular values are the solution to 



min-||6-a||| + /(&). 

b Z 

An immediate consequence of Proposition [1] is th e following well- 



olding formula for nuclear norm regularization (ICai et al.l . |2010| ) . 



mown singular value thresh- 
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Corollary 1. For a given matrix A with singular value decomposition A = t/diag(a) V T , 
the optimal solution to 



1 



mm — I LB 

B 2" 



^IIf + 



\B\ 



shares the same singular vectors as A and its singular values are bi = (aj — A) + . 

Given the above matrix thresholding formula, we are ready to present the Nesterov algo- 
rithm for minimizing ([2]). The Nesterov method has attracte d increasing attention in re cent 
years due to its efficiency in solving regularization problems (IBeck and Teboulld . l2009bl ). It 
resembles the classical gradient descent algorithm in that only the first order gradients of 
the objective function are utilized to produce next algorithmic iterate from current search 
point, and as such is simple to implement. It differs from the gradient descent algorithm by 
extrapolating the previous two algorithmic iterates to generate the next search point. This 
extrapolation step incurs trivial computational cost but improves the convergence rate dra- 
matically. It has been s hown to be optimal am ong a wide class of convex smooth optimization 
problems (jNemirovskil . 1 19941 ; iNesterovl . 120041 ) . 

We summarize our Nesterov method for solving spectral regularization problem fl2]) in 
Algorithm [TJ Each iteration consists of three steps: (i) predict a search point S by a linear 
extrapolation from previous two iterates (line|3]of Algorithm [1]), (ii) perform gradient descent 
from the search point S possibly with Armijo type line search (lines l4"] fT0]) . and (iii) force 
the descent property of the next iterate (lines [TTlfT5]) . 

In step (i), oft' is a scalar sequence that plays a critical role in the extrapolation. We 
update this sequence as in the original Nesterov method (line [16] of Algorithm [1]) , whereas 
other sequences, for instance a^' = (£ — 1)/(£ + 2), can also be used. In step (ii), the gradient 
descent is based on the first order approximation to the loss function at the current search 
point S w , 



g{B\S®,8) 



e(S {t) ) + (W(S W ),B - S w > + —\\B - S^\\ 2 F + J(B) 

2o 

L\\ B _ [£« _ 8V£{S®)]\\1 + J(B) + c®, 
2d 



where the constant 5 is determined during the line search and the constant collects 
terms irrelevant to the optimization. The "ridge" term (25)~ 1 \\B — S^\\p acts as a trust 
region and shrinks the next iterate towards If the loss function i e Ci t i, which denotes 
the class of functions that are convex, continuously differentiable and the gradient satisfies 
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1 Initialize B (0) = B«, 5 > 0, a* ' = 0, a« = 1 ; 

2 repeat 



4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 



«(') 



repeat 

A tcmp <- BW - 5V£(5W) ; 
Compute SVD A temp = Udiag(a)V T ; 

^— argmin 3 ,(25) _1 ||a; — a||| + /(a?) ; 
B tcmp <- E/diag(6)V T ; 
5 <- 5/2 ■ 
until h(B tciap ) < g(B tcmp \ S w ,<5) ; 
if h(B temp ) < h(B®) then 



B^ <- B 



else 



temp j 



end 

<- (1 + + (2a(*)) 2 )/2 



17 until objective value converges; 



Algorithm 1: Nesterov method for spectral regularized matrix regression (J2J). 



||V£(it) — V£(v)|| < C(£) \\u — v || with a known gradient Lipschitz constant C(£) for all u,v, 
then 5 is fixed at C(l)~ l . In practice, the gradient Lipschitz constant is often unknown. 
Then 5 is up dated dynamically to capture the unknow n £(£) using the classical Armijo line 



search rule (INocedal and Wright 



2006 



Langd . 120041 ) . Solution to the surrogate function 



g(B\S^\ S) is obtained by Proposition [TJ Singular value decomposition is performed on 
the intermediate matrix A tcmp = B« — SVi(S^). The next iterate B^ +1 ^ shares the 
same singular vectors as A and its singular values 6^ +1 - ) are determined by minimizing 
(2<5) _1 ||6 — o-Hl + f{b), where a = a(A temp ). For a nuclear norm regularization f(w) = 
\ w i\> ^ ne solution is given by soft thresholding the singular values fcf = (a, — X5) + . 
In this s pecial case, only the top sing ular values/vectors need to be retrieved. The Lanczos 
method ( jGolub and Van Loan! 119961 ) is extremely efficient for this purpose. For a linear 
regularization function f(w) = X\\Dw\\i where D has full column rank, reparameterization 
c = Db turns the problem to: min c ^\\(D T D)~ 1 D T c — a\\\ + A||c||i, which is a standard 
lasso problem with many efficient solvers available. When D does not have a full column 
rank, we append extra rows such that the expanded matrix, denoted by D, has full column 
rank and then solve the above lasso problem with D and only part of c penalized. 

For minimization of a smooth convex function I in Ci.i, it is well-known that the Nesterov 
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method is optimal with the convergence rate at order 0(t 2 ), where t indicates the iterate 
number. In contrast, the gradient descent has a slower convergence rate of 0{t~ l ). Our 
nuclear norm regularization problem (J2J) is non-smooth, but the same convergence result can 

heorem [2j Its proof is omitted for brevity, while 



be established, which is summarized in 
readers are referred to 



Beck and Teboullel (12009 W l. 



Theorem 2. Suppose I is continuously differentiable with a gradient Lipschitz constant C(i). 
Let be the iterates generated by the Nesterov method described in Algorithm^ Then the 
objective value h(B^) monotonically converges. Furthermore, if J is convex, then 



ft(S <«») - h(B") < ^>-B % (3) 

for all t > and any minimum point B* . 

We make a few important remarks here. The first remark regards the monotonicity of 
the objective function during iterations. Because of the extrapolation step, the objective 
values of algorithmic iterates f(B^) are not guaranteed to be monotonically decreasing. 
When the loss t G Ci t i and the regularizer J is convex, convergence of the objective values is 
guaranteed with the explicit convergence rate ([3]) . Because of potential use of a non-convex 
J, we enforce monotonicity of algorithmic iterates (lines [TTHT51 in Algorithm [1]), which is 
essential for the convergence of at least the objective values. After each gradient descent 
step, if the new iterate fails to decrease the objective value, then the current iterate is same 
as the previous one. In other words, the next gradient descent is initiated from the previous 
iterate. Fortunately, the fast convergenc e rate ([3]) still holds under the assumptions £ G C\\ 



and J is convex. See 



Beck and Teboullel (j2009al ) for the argument. 



The second remark is about the non-convex loss function. The Nesterov method and its 
convergence properties hinge upon convexity of the loss I. It covers many commonly used 
statistical models, including linear model and GLMs with canonical links. For GLM with 
non-canonical link, the loss function may be non-convex, which could cause trouble in the 
Nesterov method. The iteratively reweighted least squares strategy can be applied in this 
scenario. At each IWLS step, the Nesterov method is used to solve the penalized weighted 
least squares problem, which is convex. 

The final remark is about an efficient way for estimating the Lipschitz constant L for the 
GLM loss. Each step halving in the line search part of Algorithm [T] involves an expensive 
singular value decomposition. Therefore even a rough initial estimate of L potentially cuts 
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the computational cost significantly. Recall that a twice differentiable function / is L- 
Lipschitz continuously differentiable if and only if v T d 2 f(u)v < L\\v\\ 2 for all v. The Fisher 
information matrix of a GLM model with systematic part ([T]) is: 1(B) = E[d 2 £(B)] = 
Y^i = iUi(yecXi)(YecXi) T , where Ui = {/i-(r]j)/o"j} 2 , rji is the systematic part, /i« is the mean, 
and a 2 is the variance corresponding to the ith observation. Then in light of the Cauchy- 
Schwartz inequality 

v T I(B)v = u>i(v T vecXi)(v T vecXi) T < oji||veca;i|||||v||| = ||u||2 I /J^ill 3 ^!!]? I > 

i i \ i / 



and thus an initial estimate of L is given by L w YlAl^iiVi)} 2 1 °1 II^IIf- 

4 Degrees of Freedom 

In this section, we address the problem of choosing the tuning parameter A that yields the 
best model along the regularization path according to certain criteria. Cross validation is 
commonly used for parameter tuning in practice. However, for large data, it may incur 
considerable computation burden. Ther e exist c o mput ationally attractive alternatives, such 
as Aka ike information criterion (AIC) ( Akaike . 197^) and Bayesian information criterion 



(BIC) (jSchwarzl . Il978l ). which often yield performance comparable to cross validation in 
practice. 

Consider a normal model under the GLM ([1]). For simplicity, we again drop the covariate 
vector Z: 

Y=(X,B) + e (4) 

where e is a normal error with mean zero and variance a 2 . Let yi denote the ith observation 
of Y, and yi(X) denote the estimated response under a given tuning parameter A from the 
minimization of ([2]). Then for this normal model, AIC and BIC are defined by 

AIC(A) = ^ fa ~/* (A)}2 +2df(A) 
a 2 

BIC(A) = Eii^zM^t +l n (n)df(A). 
a 2 

In applications, the variance a 2 is often unknown but can be estimated from the fitted value 
by least squares estimation. An essential element in the above model selection criteria is the 
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effective degrees of freedom df(A) of the selected model. Using Stein's theory of unbiased 
risk estimation (jSteinl . Il98lf ). lEfronl ( 120041 ) showed that 



df(A) = E { tr (^)} = ^ COV(ft(A),M 



with expectation taken with respect to Y, y — (yi, . . . ,y n ) T , and y = ($i(A), . . . ,y n (X)) T . 
This formulation has been productively used t o derive the deg rees of freedom estimate in 



l east angle regression (jEfron et al. 



(lYuan and Linl . 



20041 ). lasso (IZou et al. 



20071 ). gro up penalized regressio n 



20061 ) , and sign-coherent group penalized regression (IChiquet et al.l . 1201 ll ) . 



We derive a degrees of freedom estimate for the nuclear norm regularized estimate under 



normal model. For an orthonormal design, i.e., S T H 



with the matrix E having rows 



(vec£Cj) T , the derived estimate is unbiased for the true degrees of freedom. In practice, it 
yields results comparable to cross validation even for non-orthogonal designs. The technical 
proof is relegated to the Appendix. 



Theorem 3. Assume that the data is generated from model 
Consider the nuclear norm regularized estimate 



with vecx; orthonormal. 



B x = argmin B i ^(y, - (X t , B)f + A||B| 



with singular values cr(Bx) = (&i(A), . . . , b q (X)) where q = min-jjOx,^}- Let B LS be the usual 
least squares estimate and assume that it has distinct positive singular values a% > ■ ■ ■ > o q > 
0. With the convention <Tj = for i > q, the following expression is an unbiased estimate of 
the degree of freedom of the regularized fit 



df(A) = Mm»>0} 1 + 



E 



Oi{oi - A) 



i=i 



E 



Pi Wi 



A) 



<7J 



0~: 



This formula for the degrees of freedom is interesting in several aspects. First it does 
not involve any information on the singular vectors of the least squares estimate, but only 
requires singular values. Second, df(A) is continuous in A, in contras t to t he piecewise 
constant degrees of freedom estimate for the classical lasso (IZou et al.l . 120071 ). Third, at 
A = 0, B = B LS almost surely has a full rank and the degrees of freedom is 



E i+ E 



07 



i=l 



1<7<P1 ,j¥=i 



E 

l<j<P2,j^l 



<7J 



q + q{q-l) + q{pi + P2 - 2q) = pip 2 , 
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5000 




Figure 2: Degrees of freedom estimate df (A) versus the number of parameters in the estimated 
model B x with a B LS e IR 64 * 64 . 



which is exactly the number of parameters without any regularization. 

Figure [2] plots the estimated degrees of freedom df(A) for a B^s £ lR 64x64 , as well as 
the naive count of the number of free parameters in the fitted matrix parameter B\ of rank 
r(A), which equals r(X)(pi + P2) — r 2 (X). The estimated degrees of freedom appears smaller 
than the naive count, reflecting the overwhelming shrinkage effect over model searching. At 
A = 0, it coincides with the number of matrix elements 64 2 = 4096, reflecting the effect of 
no shrinkage. 

Finally we notice that the degrees of freedom estimate in Theorem [3] is limiting as it 
requires existence of the least squares estimate -Bls which is not true when n < p\P2- In this 
case we may use a ridge estimate -B ridge ( r ) where 

vecSridgeCr) = (S T H + r/p lP2 ) _1 H T y, 

which always exists and is unique. Assume that B vidge ( T ) admits a singular value decompo- 
sition -B ridge(r ) = Udiag(er) V J . The following degree of freedom formula 



df(r) 



)>0} 



1=1 



1 



1 + r 



E 



<7i{(l + r)<Ti- A} 



1<?<P1 jV* 



1 



E 



<7i{(l + r)<7i - A} 



generalizes Theorem [3] and is unbiased for the true degree of freedom under the same as- 
sumptions as Theorem [3j Its proof is given in the Appendix. 
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5 Numerical Examples 



We have conducted intensive numerical studies with two aims: first, we investigate the 
empirical performance of the proposed spectral regularized regression with matrix covariates, 
and second, we compare with the corresponding classical regularization solutions. Four 
methods are under comparison: a matrix regression with nuclear norm regularization (since 
it takes the form f(w) = X Ylj=i \ w j\ * n ^ ne regularization problem fl2]), we call this solution 
matrix lasso, or simply, m. lasso), a usual vector regression after vectorizing the matrix 
covariate with a lasso penalty (lasso), a matrix regression with power spectral regularization 
(matrix power, or simply, m. power), and the corresponding vector regression with a power 
penalty (power). For the power penalty, we fixed the coefficient rj = 0.5. Our goal here is not 
to best tune rj. Instead, we examine this penalty since it yields a nearly unbiased estimate. 
We also examined another unbiased penalty, SCAD, which yields very similar results as 
power 7] = 0.5, and thus its results are not reported here for brevity. We also note that the 
lasso penalty is a convex penalty, while the power rj = 0.5 is non-convex. We summarize our 
findings in three examples. First, we elaborated on the illustrative example by examining a 
number of different geometric and natural shapes. Second, we generated some synthetic data 
and compared different regularization solutions under varying ranks and sparsity. Lastly, 
we revisited the motivating electroencephalography (EEG) data analysis mentioned in the 
Introduction. 

Example 1: 2D Shapes. We elaborated on the illustrative example in the Introduction, by 
employing the same model setup, but examining a variety of signal shapes. We presented the 
true signal followed by the estimates from the four aforementioned regularization methods, 
where the regularization parameter A was tuned by BIC. The signals of square, cross, T- 
shape are given in Figure [3j where the true signal is of a low rank structure (rank 1 or 2). 
The signals of triangle, disk and butterfly are given in Figure HI where the t rue signal is 



not o f an exact low rank structure, however, can be well approximated by so (IZhou et al. 



20121 ). It is seen that the matrix version of regularized estimators clearly outperform their 



vector version counterparts, for both lasso and power penalties. Comparing matrix lasso 
with matrix power, the two yield comparable results, while the former is better for the high 
rank signals, and the latter is better for the low rank signals. We will further verify this 
observation in the next simulation example. 
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Figure 3: 
signals. 



Comparison of the matrix and vector version of regularized estimators for low rank 
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True Signal Matrix Lasso Estimate Lasso Estimate 
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Figure 4: Comparison of the matrix and vector version of regularized estimators for high 
rank signals. 
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Table 1: Parameter estimation of a normal model. Reported are the mean and standard 
deviation (in the parenthesis) of the RMSE for B out of 100 data replications. 



Sparsity 






Rank 




s 


Method 


R= 1 


R = 5 


R= 10 


R = 20 


1% 


m. lasso 
lasso 
m. power 
power 


0.031 (0.006) 
0.022 (0.017) 
0.010 (0.001) 
0.009 (0.025) 


0.074 (0.016) 
0.022 (0.010) 
063 (0 021) 
0.005 (0.001) 


0.086 (0.015) 
0.023 (0.013) 
097 (0 021) 
0.007 (0.017) 


0.090 (0.013) 
0.022 (0.006) 
0.107 (0.017) 
0.006 (0.001) 


5% 


m. lasso 
lasso 
m. power 
power 


0.044 (0.007) 
0.212 (0.049) 
0.010 (0.001) 
0.270 (0.060) 


0.172 (0.021) 
0.214 (0.035) 
0.171 (0.029) 
0.271 (0.042) 


0.199 (0.022) 
0.215 (0.032) 
0.236 (0.029) 
0.272 (0.039) 


0.212 (0.020) 
0.217 (0.026) 
0.254 (0.024) 
0.275 (0.032) 


10% 


m. lasso 
lasso 
m. power 
power 


0.051 (0.008) 
0.320 (0.046) 
0.009 (0.001) 
0.399 (0.056) 


0.250 (0.028) 
0.343 (0.040) 
0.258 (0.041) 
0.427 (0.050) 


0.288 (0.023) 
0.343 (0.034) 
0.351 (0.033) 
0.428 (0.044) 


0.309 (0.023) 
0.345 (0.030) 
0.375 (0.032) 
0.426 (0.036) 


20% 


m. lasso 
lasso 
m. power 
power 


0.062 (0.011) 
0.475 (0.050) 
0.009 (0.001) 
0.585 (0.062) 


0.350 (0.031) 
0.523 (0.050) 
0.354 (0.041) 
0.645 (0.060) 


0.418 (0.031) 
0.539 (0.049) 
0.511 (0.039) 
0.667 (0.060) 


0.450 (0.031) 
0.534 (0.044) 
0.562 (0.044) 
0.663 (0.055) 


50% 


m. lasso 
lasso 
m. power 
power 


0.087 (0.015) 
0.775 (0.044) 
0.010 (0.001) 
0.952 (0.057) 


0.556 (0.030) 
1.054 (0.081) 
0.493 (0.042) 
1.300 (0.107) 


0.700 (0.044) 
1.112 (0.089) 
0.760 (0.055) 
1.365 (0.113) 


0.792 (0.040) 
1.130 (0.069) 
0.903 (0.062) 
1.388 (0.082) 



Example 2: Synthetic Data. We consider a class of synthetic signals to compare various 
regularization methods under different ranks and sparsity levels. Specifically, we generated 
the matrix covariates X of size 64 x 64 and the 5-dimensional vector covariates Z, both 
of which consist of independent standard normal entries. We set the sample size n = 500, 
whereas the number of parameters is 64 x 64 + 5 = 4, 101. We set 7 = (1, ... , 1) T , and 
generated the true array sig nal as B = B X B\, where B d G IR PdXfi , d = 1,2. R controls 
the rank of the generated signal. Moreover, each entry of B is 0/1, and the percentage of 
non-zero entries is controlled by a sparsity level constant s. That is, each entry of Bd is a 
Bernoulli with probability of one equal to yl — (1 — s) 1 /^. We varied the rank R — 1, 5, 10 
and 20, and the level of (non)sparsity s = 0.01,0.05,0.1,0.2 and 0.5. (So s = 0.05 means 
about 5% of entries of B are ones and the rest are zeros.) We generated both a normal and 
a binomial response Y with the systematic part as in ([I]), \i — 77 for the normal model, and 



18 



Table 2: Prediction of a normal model. Reported are the mean and standard deviation (in 
the parenthesis) of the RMSE for y out of 100 data replications. 



Sparsity Rank 

s Method R = 1 R = 5 R = 10 R = 20 

1% m.lasso 1.707 (0.195) 4.597 (1.091) 5.525 (1.009) 5.865 (0.826) 
lasso 1.610 (0.978) 1.544 (0.487) 1.622 (0.762) 1.530 (0.233) 
m.power 1.175 (0.042) 3.996 (1.164) 5.534 (1.047) 5.992 (0.874) 
power 1.236 (1.240) 1.133 (0.742) 1.168 (0.962) 1.066 (0.039) 

5% m.lasso 2.278 (0.309) 11.031 (1.418) 12.837 (1.578) 13.754 (1.352) 

lasso 13.488 (2.923) 13.625 (2.115) 13.688 (2.047) 13.855 (1.609) 





m.power 
power 


1.177 (0.043) 
14.713 (3.384) 


10.253 (1.479) 
14.877 (2.109) 


13.013 (1.593) 
14.933 (2.018) 


14.118 (1.407) 
15.113 (1.597) 


10% 


m.lasso 
lasso 
m.power 
power 


2.542 (0.420) 
19.601 (2.661) 
1.179 (0.045) 
21.138 (2.879) 


16.101 (1.940) 
21.257 (2.488) 
15.386 (2.161) 
22.796 (2.782) 


18.590 (1.568) 
21.165 (2.052) 
18.840 (1.588) 
22.708 (2.323) 


19.810 (1.514) 
21.187 (1.742) 
20.239 (1.597) 
22.546 (1.776) 


20% 


m.lasso 
lasso 
m.power 
power 


3.188 (0.655) 
28.687 (3.017) 
1.169 (0.042) 
30.879 (3.284) 


22.723 (2.157) 
31.862 (3.219) 
21.401 (2.309) 
34.350 (3.495) 


26.783 (2.264) 
32.588 (3.182) 
27.103 (2.171) 
35.013 (3.600) 


28.874 (2.256) 
32.449 (2.767) 
29.271 (2.189) 
34.998 (2.994) 



50% m.lasso 4.566 (1.026) 35.651 (2.153) 45.132 (2.989) 50.916 (3.054) 

lasso 45.815 (2.749) 62.834 (4.896) 66.490 (5.523) 67.478 (4.444) 

m.power 1.180 (0.047) 29.638 (2.079) 42.104 (2.381) 48.892 (2.945) 

power 49.798 (3.091) 68.073 (5.719) 71.626 (6.063) 72.729 (4.986) 



jj, — 1/{1 + exp(— i])} for the binomial. 

We evaluated the performance of each method from two aspects: parameter estimation 
and prediction. For the former, we employed BIC for parameter tuning and the root mean 
squared error (RMSE) as the evaluation criterion. For the latter, we used an independent 
validation data set to tune the regularization parameter and a testing data set to evaluate 
the prediction error, which is measured by the RMSE of the response for the normal case, 
and the mis-classification error rate for the binomial case. Those are all common practices in 
the literature. We replicated the experiment 100 times, and report the mean and standard 
deviation of the corresponding criterion out of 100 replications in Tables [U - HI The best 
outcomes are highlighted in bold face. Since the RMSE for the vector-valued coefficient 7 
shows the same qualitative pattern as that for the array coefficient B, we only present the 
results for B for brevity. 
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Table 3: Parameter estimation of a binomial model. Reported are the mean and standard 
deviation (in the parenthesis) of the RMSE for B out of 100 data replications. 



Sparsity 






Rank 




s 


Method 


R= 1 


R = 5 


R=10 


R = 20 


1% 


m. lasso 
lasso 
m. power 
power 


0.087 (0.023) 
0.095 (0.024) 
0.077 (0.020) 
0.095 (0.024) 


0.094 (0.021) 
0.097 (0.022) 
0.091 CO 020) 
0.097 (0.022) 


0.096 (0.015) 
0.097 (0.016) 
0.094 CO 015) 
0.098 (0.016) 


0.094 0.014) 
0.095 0.014) 
0.093 (0.013) 
0.095 0.014) 


5% 


m. lasso 
lasso 
m. power 
power 


0.210 (0.042) 
0.220 (0.042) 
0.193 (0.042) 
0.221 (0.042) 


0.222 (0.032) 
0.225 (0.032) 
0.216 (0.032) 
0.225 (0.032) 


0.231 (0.027) 
0.233 (0.027) 
0.226 (0.027) 
0.233 (0.027) 


0.231 (0.021) 
0.233 (0.021) 
0.228 (0.021) 
0.233 (0.021) 


10% 


m. lasso 
lasso 
m. power 
power 


0.304 (0.043) 
0.315 (0.043) 
0.287 (0.043) 
0.315 (0.043) 


0.333 (0.035) 
0.336 (0.035) 
0.326 (0.035) 
0.337 (0.035) 


0.339 (0.031) 
0.342 (0.031) 
0.334 (0.031) 
0.342 (0.031) 


0.331 (0.028) 
0.333 (0.028) 
0.327 (0.028) 
0.333 (0.028) 


20% 


m. lasso 
lasso 
m. power 
power 


0.438 (0.044) 
0.449 (0.044) 
0.419 (0.044) 
0.449 (0.044) 


0.502 (0.047) 
0.506 (0.047) 
0.494 (0.047) 
0.507 (0.047) 


0.510 (0.038) 
0.513 (0.038) 
0.504 (0.038) 
0.513 (0.038) 


0.515 (0.039) 
0.517 (0.039) 
0.510 (0.038) 
0.518 (0.039) 


50% 


m. lasso 
lasso 
m. power 
power 


0.695 (0.038) 
0.706 (0.038) 
0.676 (0.039) 
0.706 (0.038) 


0.990 (0.072) 
0.997 (0.073) 
0.979 (0.072) 
0.997 (0.073) 


1.044 (0.076) 
1.049 (0.077) 
1.034 (0.076) 
1.049 (0.077) 


1.056 (0.057) 
1.061 (0.058) 
1.048 (0.057) 
1.061 (0.058) 



We make the following observations. First, for the normal response, the proposed matrix 
version of estimators almost always outperform the corresponding vector version in terms of 
both parameter estimation and prediction, and this holds true for both the lasso and power 
penalties. The only exception is that when the signal is extremely sparse (s = 1% of non-zero 
elements), where the usual vector version of regularized estimators perform slightly better. 
Second, for the binomial response, the matrix version is superior than the vector version for 
all ranks and sparsity levels, and such a superiority is more striking in terms of predicting 
the binary outcome. Third, as for the effects of the signal rank and sparsity, the regular 
version of regularized estimators perform better when the signal is more sparse (a smaller 
s), while is relatively insensitive to the rank (R). By contrast, the proposed matrix version 
of regularized estimators perform better when the rank is smaller, and is insensitive to the 
coefficient sparsity. Those patterns agree with our expectations since the former penalizes 
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Table 4: Prediction of a binomial model. Reported are the mean and standard deviation (in 
the parenthesis) of the misclassification error for y out of 100 data replications. 



Sparsity 






Rank 




s 


Method 


R= 1 


R = 5 


R=10 


R = 20 


1% 


m. lasso 
lasso 
m. power 
power 


0.231 (0.029) 
0.411 (0.038) 
0.252 (0.028) 
0.763 (0.075) 


0.337 (0.030) 
0.416 (0.037) 
346 (ft 027") 
0.764 (0.071) 


0.362 (0.024) 
0.413 (0.035) 
370 (ft 025") 
0.777 (0.082) 


0.372 (0.024) 
0.415 (0.031) 
0.382 (0.029) 
0.771 (0.068) 


5% 


m. lasso 
lasso 
m. power 
power 


0.199 (0.021) 
0.459 (0.022) 
0.220 (0.025) 
0.853 (0.091) 


0.363 (0.023) 
0.460 (0.025) 
368 (ft 023") 
0.854 (0.088) 


0.393 (0.022) 
0.460 (0.030) 
398 (ft 024^ 
0.824 (0.080) 


0.409 (0.026) 
0.461 (0.023) 
0.416 (0.025) 
0.850 (0.080) 


10% 


m. lasso 
lasso 
m. power 
power 


0.194 (0.023) 
0.469 (0.022) 
0.217 (0.027) 
0.864 (0.079) 


0.359 (0.027) 
0.466 (0.022) 
0.368 (0.027) 
0.865 (0.095) 


0.396 (0.024) 
0.467 (0.023) 
0.404 (0.024) 
0.860 (0.086) 


0.404 (0.027) 
0.462 (0.020) 
0.413 (0.024) 
0.874 (0.076) 


20% 


m. lasso 
lasso 
m. power 
power 


0.193 (0.030) 
0.469 (0.022) 
0.217 (0.030) 
0.886 (0.094) 


0.343 (0.027) 
0.466 (0.025) 
0.355 (0.024) 
0.870 (0.091) 


0.375 (0.027) 
0.470 (0.025) 
0.384 (0.027) 
0.859 (0.089) 


0.396 (0.026) 
0.465 (0.023) 
0.403 (0.023) 
0.858 (0.089) 


50% 


m. lasso 
lasso 
m. power 
power 


0.187 (0.023) 
0.472 (0.019) 
0.214 (0.023) 
0.872 (0.091) 


0.282 (0.033) 
0.471 (0.025) 
0.308 (0.030) 
0.860 (0.089) 


0.312 (0.032) 
0.470 (0.023) 
0.336 (0.028) 
0.869 (0.083) 


0.344 (0.027) 
0.465 (0.023) 
0.359 (0.024) 
0.873 (0.092) 



directly on the coefficient sparsity, whereas the latter penalizes on the rank sparsity. Finally, 
comparing the lasso penalty with the power penalty, the two yield similar results, while the 
lasso usually performs better when the rank is large, and the power penalty is better when 
the rank is small. Such patterns agree with what we observed in Example 1, and can provide 
some useful guidelines when choosing a penalty function given the data. 

Example 3: EEG Data Analysis. We analyzed the motivating EEG data in the Intro- 
duction and compared a number of regularization estimators. The data set consists of 77 
individuals with alcoholism and 44 individuals as the control. For each subject, 64 channels 
of electrodes were placed at different locations of scalp and the voltage values are recorded 
at 256 time points. Besides, each subject performed 120 trials under three types of stimuli: 
single stimulus, two matched stimuli and two unmatched stimuli. One primary interest was 
to study the association between alcoholism and the pattern of voltage values over times and 
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Figure 5: EEG signals of an alcoholic subject (left) and a control subject (right). Each curve 
tracks signals from one electrode. 

channels. Figure |5] displays the EEG signals from a randomly chosen alcoholic subject and 
a control subject, where the horizontal axis is the time, the vertical axis is the voltage, and 
each curve represents an electrode channel. It is clearly seen that the alcoholic and control 
su bjects demonstr ate d istinct characteri s tics. 



Li et al. 



(120101 ) and lHung and Wangi (120111 ) both analyzed this same data set. Following 
their practice, we focused on the data under the single stimulus condition only, and aver- 
aged for all the trials under that condition for each subject. The resulting covariates Xi are 
256 x 64 matrices, and the response y^ is a binary variable indicating whether the ith subject 
is alcoholic (?/, = 1) or not (yi — 0), i — 1, . . . , 122. We applied matrix lasso, lasso, matrix 
power and power to the data, and evaluated each solution via cross-validation based mis- 
classification error. More specifically, we divided the full data into a training and a testing 
sample using fc-fold cross-validation. When k equals the sample size, it is the leave-one-out 
cross-validation. Then for the training data, we further employed a 5-fold cross-validation 
to tune the shrinkage parameter A. We then applied the tuned model that is fully based 
on the training data now to the testing data and evaluated the misclassification error rate 
for testing. Table [5] reports the results for leave-one-out, 5, 10, and 20-fold cross-validation 
results. It is clearly seen that the matrix version of the regularized estimators achieve smaller 
misclassification errors compared to the vector version counterparts. Comparing the lasso 
and the power penalties, the lasso achieves a slightly better classific ation accuracy. 



We also make a few remarks regarding to this data analysis. First. IHung and Wangl (120111 ) 
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Table 5: Misclassification error rate for EEG data. 



Method 


leave-one-out 


5-fold 


10-fold 


20-fold 


m.lasso 


0.230 


0.214 


0.222 


0.181 


lasso 


0.246 


0.287 


0.271 


0.264 


m.power 


0.246 


0.222 


0.228 


0.213 


power 


0.254 


0.329 


0.244 


0.276 



analyzed the sa me data using a m atrix logistic regression, which is a rank-1 special case of the 



tensor GLM of 



Zhou et al 



( 120121 ). Their method does not work for n < p, and thus a ridge 
type regularization was employed. The ridge tuning parameter was chosen so that the leave- 
one-out classification accuracy is maximized, and the corresponding best classification result 
was reported, which is slightly better than the classification accuracy reported in Table [5j 
We believe, however, a fair evaluation of prediction should have the parameter tuning solely 
based on the training data, and then report the corresponding testing error. If one tunes the 
para meter ba s ed on the reported testing error, t he result would be over optimistic. Second, 



both 



Li et al. 



( 120101 ) and lHung and Wand (120111 ) preprocessed the EEG data by reducing the 
dimensionality of the matrices to a smaller scale. Both used principal components analysis 
(PCA) for reduction, but they employed different variants of PCA for matrices. Part of 
reason for the dimension reduction preprocessing was that their proposed numerical methods 
can not directly handle the size of 256 x 64 of the EEG data. By contrast, our proposal 
is not limited by the matrix size and were directly applied to the original data, since our 
Nesterov algorithm is highly efficient and scalable. On the other hand, we agree that such 
preprocessing could potentially improve the overall classification accuracy by removing noisy 
irrelevant information. However, we note that, PCA is known as an unsupervised dimension 
reduction approach that can not guarantee full preservation of all relevant information. 
Moreover, it introduces another layer of tuning, including choosing the right version of PCA 
for matrices, and determining the optimal number of principal components. Our goal of 
this data analysis has primarily been the comparison of the regularized matrix regression 
estimators with the vector counterparts, so we have chosen not to include any preprocessing 
that requires a significant amount of additional tuning. 
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6 Discussion 



Motivated by modern scientific data arising in areas such as imaging and cytometry, we 
study in this article the problem of regressions with matrix covariates. Regularization is 
bound to play a crucial role in such regressions due to the ultrahigh dimensionality and 
complex structure of the matrix data. We have proposed a class of regularized matrix 
regression models by penalizing the spectrum of the matrix parameters. It is based upon the 
observation that the matrix signals are often of, or can be well approximated by, a low rank 
structure. Consequently, the new method focuses on the sparsity in terms of the rank of the 
matrix parameters rather than the number of nonzero entries, and is intrinsically different 
from the classical lasso and related penalization approaches. 

In many applications, sparsity is sought in certain pre-specified basis systems rather than 
the original coordinates. Specifically, the systematic component takes the form 

V = ~f T Z+(B,SlXS 2 ), 

where Sj G IR^** contains qj basis vectors for j = 1,2. For 2D images, over-complete 
wavelet basis (qj > pj) are often used in both dimensions. For the EEG data that are 
channels by time points, wavelet or Fourier basis can be applied to the time dimension. It 
is of direct interest to seek a sparse low rank representation of the signal B in the basis 
system by solving the regularized regression. In that sense, our proposed regularized matrix 
regression can be considered as the matrix analog of the classical basis pursuit problem 



(IChen et all . l200lf ). 

We have concentrated on problems with matrix covariates throughout this article. In 
applications such as anatomical magnetic resonance imaging (MRI) and functional magnetic 
resonance imaging (fMRI), the covariates are in the form of multidimensional arrays, a.k.a. 
tensors (D > 2). It is natural to extend the work here to regularized tensor regressions. 
However, the problem formulation requires an appropriate norm for tensors that is in analo- 
gous to the nuclear norm for matrices, and the involved regularization and optimization are 
to be fundamentally different from the methods for matrices. We are currently pursuing this 
line of extension, and will report the results elsewhere. 
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